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Identifying and removing spurious links in complex networks is a meaningful problem for many 
real applications and is crucial for improving the reliability of network data, which in turn can 
lead to a better understanding of the highly interconnected nature of various social, biological and 
communication systems. In this work we study the features of different simple spurious link elimina- 
tion methods, revealing that they may lead to the distortion of networks' structural and dynamical 
properties. Accordingly, we propose a hybrid method which combines similarity-based index and 
edge-betweenness centrality. We show that our method can effectively eliminate the spurious inter- 
actions while leaving the network connected and preserving the network's functionalities. 

PACS numbers: 89.75.Hc, 89.75.-k, 89.20.-a 



I. INTRODUCTION 

Many social, biological and information systems are 
naturally described by networks, where nodes represent 
individuals, proteins, genes, computers, web pages, and 
so on, and links denote the relations or interactions be- 
tween nodes. Network analysis has hence become a cru- 
cial focus in many fields including biology, ecology, tech- 
nology and sociology lj. However, the reliability of net- 
work data is not always guaranteed: constructed bio- 
logical and social networks may contain inaccurate and 
misleading information, resulting in missing and spurious 
links @, 0] . 

The problem of identifying missing interactions, known 
as link prediction, consists in estimating the likelihood of 
the existence of a link between two nodes according to 
the observed links and node's attributes [H . Link predic- 
tion has already attracted much attention from disparate 
research communities due to its broad applicability. For 
instance, in many biological networks (such as food webs, 
protein-protein interactions and metabolic networks) the 
discovery of interactions is often difficult and expensive, 
hence accurate predictions can reduce the experimental 
costs and speed the pace of uncovering the truth [f| @ . 
Applications in social networks include the prediction of 
the actors co-starring in acts[7| and of the collaborations 
in co-authorship networks the detection of the un- 
derground relationships between terrorists § , and many 
others. In addition, the process of recommending items 
to users can be considered as a link prediction problem in 
a user- item bipartite graph Q, so that similarity-based 
link prediction techni ques have been applied to personal- 
ized recommendation [lfjj. Moreover, the link prediction 
approach can be used to solve the classification problem 
in partially labeled networks, such as predic ting protein 
functions jllj], detecting anomalous email [l2| distin- 
guishing the research areas of scientific publications [l3j 
and findingout the fraud and legit users in cell phone 
networks [14]]. For a review of the field, see [15j . 

On the other hand, the problem of identifying spurious 
interactions has received less attention despite its numer- 
ous potential applications. For instance, the identifica- 



tion of inactive connections in social networks or spam 
hyperlinks in the WWW may improve the efficiency of 
link-based ranking algorithms [I6j . and the detection of 
redundant interactions in biological, communication or 
citation networks may find applications in community- 
detection, in constructing networks' backbones [l?} or 
in other connection optimization problems. A possible 
reason for the lack of effective methods to deal with this 
problem is that a spurious link removal error has far more 
serious consequences than a missing link addition one. If 
some "unexpected" links are incorrectly identified as spu- 
rious and removed from the network, the system's struc- 
ture and function may be altered significantly or even 
compromised. For instance, the network may break up 
into separate components so that the system's function- 
ality is destroyed. In power grids, only the power plants 
in the giant component can work [18(. In traffic systems, 
only the cities in the giant component can mutually com- 
municate jl~9l ] . In neural systems, only neurons in the 
giant component can reach a synchronized state and ef- 
fectively process signals [20(. The main challenge for a 
spurious link detection method is hence to identify the 
spurious interactions and at the same time to construct 
a network with close functionalities to the original one. 

In this work we show that many simple spurious links 
detection methods have indeed the serious drawback to 
remove real and important links, which causes the net- 
works' structure to be altered significantly. Hence we 
propose a hybrid algorithm which combines a similarity- 
based index known as common neighbors with the edge- 
betweenness centrality. We show that this method can 
not only effectively identify and remove spurious links 
but also preserve the size of the giant component and 
many important structural and dynamical properties of 
the network at the same time. 



II. METHOD 

In this section we describe our procedure to study 
the features and evaluate the performance of a spurious 
link detection algorithm. We make use of six empiri- 
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TABLE I. Features of empirical networks: number of nodes (N) and edges (E), average degree ((k)), average shortest path length 
((d)), clustering coefficient (C), degree assortativity (r), degree heterogeneity (H — (k 2 )/(k) 2 ) and traffic congestability (B-max) 
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cal undirected networks: the C. elegans neural network 
(CE) 21], an email network (Email) [22j], a scientists' 
co-authorships network (SC) [2J|, the US political blogs' 
network (PB) [24] . a protein-protein interaction network 
(PPI) J25[ and the US air transportation network (US- 
Air) [26| . We only consider the giant component of these 
real networks. Some properties of these systems are re- 
ported in Table fl] All of these networks are widely used 
in the literature as model systems, hence we assume that 
they are "true" networks (i.e. without spurious interac- 
tions), which we denote as A*. We then add to these 
true networks a fraction / of spurious random connec- 
tions to obtain "observed" networks, which we denote as 
A° , and evaluate the ability of the spurious link detection 
algorithm to recover the features of the true network. 

To quantify the accuracy of the algorithm in identify- 
ing the spurious interactions we use the standard metric 
of the area under the receiver operating characteristic 
curve (AUC) [27] . Since the algorithm returns an or- 
dered list of links (or equivalently gives each link a score 
to quantify its reliability), the AUC represents the prob- 
ability that a spurious link is ranked lower than a true 
link. To obtain the value of the AUC, we randomly pick 
a spurious link and a true link in the observed network 
A° and compare their scores. If, among n independent 
comparisons, the real link has higher score than the spu- 
rious link n' times and equal score n" times, the AUC 
value is: 

AUC="' W,/2 
n 

Note that if links were ranked at random, the AUC value 
would be equal to 0.5. 

As stated in the introduction, high accuracy is not suf- 
ficient for a spurious link detection algorithm: if just a 
few real important links are removed, the structural and 
dynamical properties of the network may change dra- 
matically. A simple example can be seen in fig. 1. If the 
dashed link is removed, the network will break into two 
separated components. To study the robustness of the 
algorithm in this respect, we remove from the observed 
network the fraction /' of the bottom-ranked links to 
obtain the "reconstructed" network, which we denote as 
A r . We then compare the structure and functionality of 
true and reconstructed networks. We will focus mainly 
on giant component's (GC) size, which is of great impor- 



FIG. 1. A simple example to illustrate how an improper spu- 
rious link removal method can disconnect a network. 

tance for the functionality of many real systems. Then we 
will consider clustering coefficien t [281 . average shortest 
path length, traffic congestability [29] (i.e. the maximum 
betweenness centrality in the network) and other dynam- 
ical properties. We will first study the case of A* and A r 
having the same number of links (/' = /). However, 
as in general one doesn't know how many spurious links 
there are in a given network, we will finally consider the 
situation where f'^f- 

III. RELIABILITY INDICES 

In this section we describe some representative spuri- 
ous link detection methods. These algorithms assign to 
each link in A° a "reliability" index (denoted as R^ for 
the link connecting nodes i and j) which quantifies the 
likelihood of its true existence and allows for link ranking. 

Similarity-based indices use the network's structure to 
assign for each pair of connected nodes i j a score which 
is directly defined as their similarity, with the underlying 
assumption that a connection between similar node is 
likely to be a true one. These algorithms can be classified 
into local, quasi-local and global according to the amount 
of information they need. Here are some examples: 

• Common Neighbors (CN): Rff = \\Ti HTA], where 
Ti is the set of neighbors of node i and |J . . . || indi- 
cates the number of nodes in a set. 

• Resource Allocation (RA): Rf^ = J^keFinTj pYH' 

• Local Path (LP): R\J = (A 2 )^ + e (A 3 ) ij; where A 
is the network's adjacency matrix and e < 1 is a 
free parameter. 

• Katz index (Katz): Rf^ = J^Zi [G 8 A )% ; > where 
/3 is a free parameter which must be lower than the 
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reciprocal of the largest eigenvalue of A. 

Centrality-based indices measure the importance of a 
link in the network, assuming that the higher the link's 
centrality, the higher its reliability. We consider two sim- 
ple indices: 

• Preferential Attachment (PA): R?f- = \\Ti\\ X H^H- 



• Edge Betweeness (EB): Rf? = J2 m > n c^T' wnere 
C mn is the number of shortest paths from node m 
to node n and Cmn is the number of such shortest 
paths passing through the link ij . 

Clearly, CN, RA and PA are local indices. CN is 
the simplest possible measure of neighborhoods' overlap, 
while RA [30| is the best performing local index for the 
purpose of link prediction. PA is the algorithm which 
requires less information. LP [3(| is instead a quasi-local 
method, as it considers local paths with wider horizon 
than CN (it also counts the number of different paths 
with length 3 connecting i and j). Finally, Katz [31| and 
EB methods are global indices, as they are based on the 
ensemble of all paths in the network. Specifically, Katz 
counts the paths between two nodes and weights them 
according to their length I, while EB is built with the 
number of shortest paths from all vertices to all others 
that pass through the given link. 




FIG. 2. (Color online) AUC for various indices and for dif- 
ferent values of /. The true networks are (a)CE, (b)Email, 
(c)SC, (d)PB, (e)PPI, (f)USAir. Results are averaged over 
100 independent realizations. Note that the curves for EB 
are not shown as the respective AUC values are too low. The 
same holds for PA in panel (c). 
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IV. HYBRID INDEX 

We now introduce a hybrid index which combines 
the similarity-based and the centrality-based approaches. 
The underlying idea is that we consider a link to be a 
"true" one either if it connects similar nodes or if it has 
a central position in the network. Even if this assumption 
is not necessarily true, as we will show later it avoids the 
removal of important links so that the network's proper- 
ties and functions are preserved, with the small drawback 
of failing to identify few spurious interactions. 

To construct the Hybrid index, we combine the simple 
common neighbor with edge-betweenness centrality as: 

fthyb _ ^ Hj h (1 - A) — 

max mn (R mn ) max mn ( K R ran ) 

where A G [0, 1] is the hybridization parameter. In what 
follows we set A = 0.9, because we want to exploit mainly 
CN and a small contribution from EB will suffice for our 
purposes (however, see section lVll for a study of the index 
behavior for different A). Note that this is only one possi- 
bility of defining such index. We made use of CN because 
it is the most well-known of the similarity-based indices. 
However one could use e.g. RA or Katz instead, though 
the qualitative features of the Hybrid method wouldn't 
change. 



In this section we compare the features of the spurious 
link detection approaches which have been previously in- 
troduced. We start by adding to the true networks A* 
a fraction / of random connections to obtain the ob- 
served networks A". For each particular index, we rank 
the links according to their reliability values and mea- 
sure the accuracy of the method in identifying spurious 
interactions by the AUC (Figure[2]) . We observe that gen- 
erally the similarity-based methods perform better than 
the centrality-based ones. Among the first category, Katz 
and LP [33] perform slightly better than CN and RA as 
they take advantage of using more information. Among 
the second, EB is the worst performing, with AUC even 
lower than 0.5. The performance of the Hybrid method 
is instead very close to that of the pure similarity-based 
indices. Hence having a contribution from EB in the 
hybridization does not result in worse spurious link de- 
tection (as one might expect). 

We already argued that accuracy is not the only cri- 
terion to assess the performance of these methods. The 
other important aspect is that the removal of putative 
spurious links should not alter the giant component's size 
as well as other properties of the networks. To investi- 
gate this aspect, we remove from A° the fraction /' of 
the bottom-ranked links to obtain the reconstructed net- 
works A r , whose features we compare with the ones of 
the relative true networks A*. We start with the simple 
case f = f and we first focus on the GC's size, which 
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FIG. 3. (Color online) GC's size when various indices are used 
to build A r (here /' = /) and for different /. The true net- 
works are (a)CE, (b)Email, (c)SC, (d)PB, (e)PPI, (f)USAir. 
Results are averaged over 100 independent realizations. 



is of great relevance in many contexts. As shown in Fig- 
ure [31 the GC's size significantly decreases with /' when 
using any similarity-based method (as well as PA): in 
these cases many nodes becomes disconnected from the 
networks' core and end up losing their function. On the 
contrary, EB always keeps the networks connected. This 
is not surprising, as it has already been pointed out 33 1 
that similarity indices and EB are highly anti-correlated, 
meaning that removing links between non-similar nodes 
causes links with high betweenness to be cut, and vice- 
versa. What is remarkable is that also the Hybrid method 
can effectively preserve the connectedness of the origi- 
nal networks in most of the cases, and in general much 
better than any other similarity-based method, despite 
the small contribution it receives from EB. It is hence 
sufficient to increase little the reliability of central and 
important links to avoid removing them. 

We move further by considering other network proper- 
ties. In order to compare the true and the reconstructed 
networks under a given property X, we compute the rela- 
tive error of X as (X(A r ) - X(A t ))/X(A t ). As a bench- 
mark, we also compute the relative error of X in the 
observed networks as {X(A°) - X(A t ))/X(A t ). For an 
effective spurious link removal method, which is able to 
reproduce the properties of the true network, the abso- 
lute value of the relative error for A r should be smaller 
than the absolute value of the relative error for A° (mean- 
ing that A r is a better estimate of A f than A°) and as 
close as possible to zero (meaning that X has approxi- 
mately the same value in A 1 and A r ). Figured] shows the 
relative errors made by CN and Hybrid methods for clus- 
tering coefficient, average shortest path length and traffic 
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FIG. 4. (Color online) Relative errors of clustering coefficient 
(a)-(b), average shortest path length (c)-(d) and transporta- 
tion congestability (e)-(f) for different /. The different lines 
correspond to the relative errors in A° and in the two A 1 ' 
built by CN and Hybrid methods respectively, with /' = /. 
Left plots refer to PB while right plots to USAir. Results are 
averaged over fOO independent realizations. 



congestability (i.e. the maximum betweenness centrality 
in the network) . We only report the results for the Polit- 
ical blog (PB) and US Airline (USAir) networks, as these 
are the cases in which the GC's size is relatively more af- 
fected when using pure similarity-based methods (Figure 
[3]). We observe that in these cases the Hybrid method 
is always able to restore the properties of the true net- 
work with respect to the observations, while this is not 
always true for CN. Moreover, the Hybrid method al- 
ways preserves the networks' properties better than CN, 
at the small cost of achieving smaller AUC values. This 
is because CN and other similarity-based methods alter 
the GC, which is much more harmful for the networks' 
properties and functions than keeping fewer more spuri- 
ous links. Note however that if the CN method does not 
cause serious enough damage to the GC — as it happens 
for C. elegans neural (CE) and scientists' co-authorships 
(SC) networks — then the situation may be reversed: CN 
can preserve some of the network properties better than 
the Hybrid method due to its higher accuracy. 

There are plenty of other network's static and dynam- 
ical properties which can be considered, such as synchro- 
nization, spreading threshold, and so on. As these dy- 
namics can only take place in the GC, similarity-based 
methods which break the network into pieces alter them 
seriously. For example, the nodes out of the GC can 
never reach the global synchronized state, and the signal 
from the GC can never spread to these nodes. Again, 
these methods eventually destroy the system's functions. 
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FIG. 5. (Color online) The GC's size when different fractions 
of links /' are removed from A" by CN, Hybrid and EB meth- 
ods. The true networks are (a)CE, (b) Email, (c)SC, (d)PB, 
(e)PPI, (f)USAir. Results are averaged over 100 independent 
realizations. 



FIG. 6. (Color online) The residual fraction of spurious links 
in A r when different fractions of links /' are removed from 
A° by CN, Hybrid and EB methods. The true networks are 
(a)CE, (b)Email, (c)SC, (d)PB, (e)PPI, (f)USAir. Results 
are averaged over 100 independent realizations. 



As in real applications of spurious links removal one 
does not know the exact number of spurious links in a 
network, we finally consider the case when f'=£f. To 
do so, we fix the number of random connections added 
to A* at / = 10%. We then study the properties of 
the networks A r reconstructed by different methods by 
removing different fractions /' of links from A°. 

Figure[5]shows the GC's size for varying /'. We observe 
that the GC's size naturally decreases with the fraction 
of removed links. Such decrease is very fast when using 
CN and very slow when using EB — in the latter case, the 
GC's size is preserved in any network even when half of 
the links are removed. The Hybrid method lies between 
these two, and remarkably it performs like EB when the 
fraction of removed links is not too big (in many cases the 
GC's size has a plateau which may last up to large /'). 
Another interesting aspect would be to investigate how 
many of the original / spurious links are left in the net- 
works for various /'. Results are shown in Figure [SI We 
again observe that the more we remove links, the higher 
the probability to remove a spurious link. Due to its low 
accuracy, EB must remove almost all links in order to 
get rid of the spurious ones. On the contrary, CN can 
eliminate all the spurious links quite soon (/' ~ 25%). 
Interestingly, the Hybrid performs as well as CN and 
their curves almost overlap. These results again indicate 
that the Hybrid method represents an effective approach 
to both preserve the GC's size and to achieve high accu- 
racy. Moreover, it is also more robust than other methods 
when considering the intrinsic uncertainty of the number 
of spurious interactions in a system. 



VI. THE HYBRIDIZATION PARAMETER 

At last, we show how the Hybrid index behaves by 
varying the value of the parameter A. In order to do 
so, we consider the particular case in which the observed 
networks A° are obtained from the true networks A* with 
the addition of / = 20% of spurious links. Figure [7] shows 
AUC and GC'size of the networks A r reconstructed by 
the Hybrid method (with /' = /) for different values of A. 
We observe that while the AUC decreases for decreasing 
A (but this decrease is always slower at the beginning), 
the GC remains almost integer except when A becomes 
too close to 1. Therefore it is sufficient to have a small 
contribution from EB in the Hybrid method to keep the 
network connected at the cost of being slightly less ac- 
curate. This is the reason why we have previously set 
A = 0.9. Note that one can always use a bigger value of 
A if accuracy is the main goal, or a smaller value if the 
GC's integrity is a major issue. 

VII. DISCUSSION 

How to detect and remove spurious interactions in net- 
works is a significant problem which may find application 
in almost any field of complex science. Still, it has not 
yet attracted much attention, as the consequences of a 
removal error can heavily harm the system under investi- 
gation. In the literature many similarity-based methods 
for the purpose of link prediction have been proposed. 
In this work we showed that, when applied to spurious 
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FIG. 7. (Color online) The size of the GC in the networks 
reconstructed by the Hybrid method with different values of A. 
Insets: the AUC for different A. The respective true networks 
are (a)CE, (b)Email, (c)SC, (d)PB, (e)PPI, (f)USAir. Results 
are averaged over 100 independent realizations. 



link detection, all these methods achieve high accuracy 
but suffer from the important drawback of decreasing the 
size of the giant component and distorting other static 
and dynamic properties of the network. This harmful ef- 
fect may cause a system to lose its functions, as nodes 
which arc disconnected from the GC cannot communi- 
cate with the network's core. In order to overcome these 
drawbacks, we proposed a hybrid method which com- 
bines the similarity-based common neighbors index with 
edge-betweenness centrality. We showed that this ap- 
proach can effectively eliminate the spurious links and 
at the same time keep the network connected; moreover 
important properties like clustering coefficient, average 
shortest path length and traffic congestability can be gen- 
erally preserved better. This method is still more advan- 
tageous when the number of spurious interactions within 
a system is unknown. 



In the literature there are other important examples 
of spurious link detection approaches (e.g. hierarchical 
random graph Q and stochastic block model [13]) which 
however were not focusing on preserving the giant compo- 
nent's size. Moreover these methods are based on global 
algorithms which can be prohibitive to use for large-scale 
systems. Our method instead would be easily applicable 
for large networks. This is because it combines common 
neighbors index, which requires only local information of 
a link, and edge-betweenness centrality, whose computa- 
tional complexity is now as lower as O(NE), where N 
and E are respectively the number of nodes and edges in 



the network [35 1. 



Finally, we remark that the problem of identifying spu- 
rious interactions is much more difficult to deal with than 
predicting missing interactions. We already pointed out 
how serious a removal error may be. In addition, while in 
link prediction studies there's a true network from which 
some existing links are removed to generate the observa- 
tion and test the algorithm, for spurious link detection 
how to add spurious interactions to the true network is 
generally unknown. In this work we explored the simplest 
situation, in which spurious links are just random connec- 
tions between nodes. This approach can be suitable for 
describing some systems (for instance biological networks 
obtained from measurements prone to random errors, or 
social networks in which some links result from once in 
a lifetime interactions between people) but may result 
inadequate for others (like biological systems when mea- 
surements are prone to systematic errors, or the WWW 
where spam hyperlinks always start from the same set 
of pages). The effectiveness of a spurious link detection 
method in these systems hence deserve further validation, 
which will be the subject of future work. 
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